Conversation
c567dc6 to
c01fa70
Compare
There was a problem hiding this comment.
Pull request overview
Enables multi-datacenter deployments by expanding the DataCenter enum, allowing datacenter=None (all DCs) or lists, and adding first-class support for attaching multiple network volumes (one per datacenter) with corresponding GraphQL payload updates.
Changes:
- Expand
DataCenterto all supported RunPod DCs and default endpoint/resource datacenter behavior toNone(no restriction / all DCs). - Add multi-volume support (accept single or list; validate one volume per DC; send
networkVolumeIdswhen multiple). - Add CPU datacenter restrictions via exported
CPU_DATACENTERS, with validation and clearer errors.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_p2_gaps.py | Updates expectations for default locations/datacenter behavior when unset. |
| tests/unit/test_endpoint.py | Adds coverage for volume normalization to lists and expanded datacenter input forms. |
| tests/unit/test_deprecations.py | Marks CPU_DATACENTERS as non-deprecated public API. |
| tests/unit/resources/test_serverless.py | Updates/extends tests for multi-DC datacenter normalization, locations syncing, multi-volume validation, and CPU DC validation. |
| tests/unit/resources/test_network_volume.py | Adds tests for NetworkVolume(datacenter=...) alias behavior. |
| src/runpod_flash/runtime/resource_provisioner.py | Passes locations from manifest into reconstructed resources. |
| src/runpod_flash/endpoint.py | Accepts volume as single or list; forwards datacenter as-is; serializes multi-volume configs into networkVolumes. |
| src/runpod_flash/core/resources/serverless_cpu.py | Excludes networkVolumes from CPU endpoint payload. |
| src/runpod_flash/core/resources/serverless.py | Adds multi-volume normalization/validation; datacenter normalization; CPU DC validation; deploy/update inject networkVolumeIds. |
| src/runpod_flash/core/resources/network_volume.py | Expands DataCenter; adds from_string; exports CPU_DATACENTERS; adds datacenter alias for dataCenterId. |
| src/runpod_flash/core/resources/load_balancer_sls_resource.py | Excludes networkVolumes from CPU load balancer payload. |
| src/runpod_flash/core/resources/init.py | Re-exports CPU_DATACENTERS. |
| src/runpod_flash/core/api/runpod.py | Extends saveEndpoint selection set to include networkVolumeIds. |
| src/runpod_flash/cli/commands/build_utils/manifest.py | Persists locations into the build manifest deployment config. |
| src/runpod_flash/init.py | Exposes CPU_DATACENTERS at the top-level package API. |
| docs/Flash_SDK_Reference.md | Documents expanded DC set, datacenter=None, CPU_DATACENTERS, and multi-volume examples. |
| docs/Flash_Deploy_Guide.md | Updates persistent storage examples for explicit DC targeting and multi-volume usage. |
Comments suppressed due to low confidence (2)
src/runpod_flash/cli/commands/build_utils/manifest.py:242
- Manifest extraction still only serializes the singular
networkVolume(andnetworkVolumeId), so endpoints configured with multiple volumes (networkVolumes) will be written to the manifest with only the first volume. This will break deploy/runtime provisioning for multi-volume configs. Extend manifest extraction to includenetworkVolumeswhen present (and ensure runtime reconstruction supports it).
if hasattr(resource_config, "locations") and resource_config.locations:
config["locations"] = resource_config.locations
if hasattr(resource_config, "env") and resource_config.env:
env_dict = dict(resource_config.env)
env_dict.pop("RUNPOD_API_KEY", None)
if env_dict:
config["env"] = env_dict
if hasattr(resource_config, "networkVolume") and resource_config.networkVolume:
nv = resource_config.networkVolume
nv_config = {"name": nv.name}
if nv.size is not None:
nv_config["size"] = nv.size
if hasattr(nv, "dataCenterId") and nv.dataCenterId is not None:
nv_config["dataCenterId"] = (
nv.dataCenterId.value
if hasattr(nv.dataCenterId, "value")
else nv.dataCenterId
)
config["networkVolume"] = nv_config
src/runpod_flash/runtime/resource_provisioner.py:157
create_resource_from_manifestcan reconstruct onlynetworkVolume/networkVolumeId, but multi-volume endpoints are now represented vianetworkVolumes(and deployed vianetworkVolumeIds). Without handling those keys, manifest-based provisioning will silently drop additional volumes. Add support fornetworkVolumes(and/ornetworkVolumeIds) reconstruction here.
if "locations" in resource_data:
deployment_kwargs["locations"] = resource_data["locations"]
# Reconstruct NetworkVolume from manifest data if present
if "networkVolume" in resource_data:
from runpod_flash.core.resources.network_volume import NetworkVolume
nv_data = resource_data["networkVolume"]
deployment_kwargs["networkVolume"] = NetworkVolume(**nv_data)
elif "networkVolumeId" in resource_data:
deployment_kwargs["networkVolumeId"] = resource_data["networkVolumeId"]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # CPU endpoints don't use GPU-specific fields, so exclude them from API payload | ||
| # This prevents the RunPod GraphQL API from rejecting CPU endpoints with GPU-specific fields | ||
| # Note: instanceIds is NOT in _input_only, so it will be sent to the API | ||
| _input_only = { | ||
| "id", | ||
| "cudaVersions", # GPU-specific, exclude from API payload | ||
| "datacenter", | ||
| "env", | ||
| "gpus", # Inherited from parent, but always None for CPU endpoints | ||
| "gpuIds", # GPU-specific API field, exclude from payload | ||
| "gpuCount", # GPU-specific API field, exclude from payload | ||
| "allowedCudaVersions", # GPU-specific API field, exclude from payload | ||
| "flashboot", | ||
| "flashEnvironmentId", | ||
| "imageName", | ||
| "networkVolume", | ||
| "networkVolumes", | ||
| "python_version", | ||
| } |
There was a problem hiding this comment.
CpuServerlessEndpoint now excludes networkVolumes from the API payload via _input_only, but the overridden config_hash below only includes networkVolume (singular). That means changing the multi-volume configuration won’t change config_hash, so drift detection/updates may not trigger. Include networkVolumes in the CPU config hash (or otherwise ensure multi-volume changes affect drift detection).
| _input_only = { | ||
| "id", | ||
| "cudaVersions", | ||
| "datacenter", | ||
| "env", | ||
| "flashboot", | ||
| "flashEnvironmentId", | ||
| "gpus", | ||
| "gpuIds", | ||
| "gpuCount", | ||
| "allowedCudaVersions", | ||
| "imageName", | ||
| "networkVolume", | ||
| "networkVolumes", | ||
| "python_version", | ||
| } |
There was a problem hiding this comment.
CpuLoadBalancerSlsResource now excludes networkVolumes from the API payload via _input_only, but its custom config_hash implementation later in the class only hashes networkVolume (singular). Multi-volume changes may not be detected as drift, preventing needed updates. Include networkVolumes in the CPU LB config hash (or ensure multi-volume affects drift detection).
| from runpod_flash import NetworkVolume, DataCenter | ||
|
|
||
| # existing volume by ID | ||
| vol = NetworkVolume(id="vol_abc123") |
There was a problem hiding this comment.
The docs show vol = NetworkVolume(id="vol_abc123"), but NetworkVolume currently requires a name field (no default) in the model. Either update the example to include name=... when referencing an existing volume, or make name optional when id is provided.
| vol = NetworkVolume(id="vol_abc123") | |
| vol = NetworkVolume(id="vol_abc123", name="my-existing-volume") |
| async def _ensure_network_volume_deployed(self) -> None: | ||
| """Ensures all network volumes are deployed. | ||
|
|
||
| Deploys each volume in networkVolumes and collects their IDs. | ||
| Sets networkVolumeId (singular) for backward compat with the first volume. | ||
| Populates _deployed_volume_ids for multi-volume API payloads. | ||
| """ | ||
| Ensures network volume is deployed and ready if one is specified. | ||
| Updates networkVolumeId with the deployed volume ID. | ||
| """ | ||
| self._deployed_volume_ids: list[str] = [] | ||
|
|
||
| if self.networkVolumeId: | ||
| return | ||
| self._deployed_volume_ids.append(self.networkVolumeId) | ||
|
|
There was a problem hiding this comment.
_ensure_network_volume_deployed assigns self._deployed_volume_ids on a Pydantic model without declaring it as a field/private attr. With the current BaseResource.model_config (no extra='allow'), setting an undeclared attribute will raise at runtime, breaking deploy/update. Define _deployed_volume_ids as a Pydantic PrivateAttr (or stash it via self.__dict__[...]) so it can be set safely.
src/runpod_flash/endpoint.py
Outdated
| if isinstance(volume, NetworkVolume): | ||
| return [volume] | ||
| if isinstance(volume, list): | ||
| return volume or None |
There was a problem hiding this comment.
_normalize_volumes accepts a list but doesn’t validate that every element is a NetworkVolume. If a caller passes a mixed/incorrect list, the failure will happen later (e.g., when calling model_dump), producing a less clear error. Consider validating list element types here and raising a ValueError that points to the bad element.
| return volume or None | |
| if not volume: | |
| # preserve existing behavior: empty list normalizes to None | |
| return None | |
| for idx, vol in enumerate(volume): | |
| if not isinstance(vol, NetworkVolume): | |
| raise ValueError( | |
| "volume list elements must be NetworkVolume; " | |
| f"element at index {idx} is {type(vol).__name__}" | |
| ) | |
| return volume |
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. Multi-volume drift detection blind spot
_hashed_fields includes "networkVolumeId" (singular) but not "networkVolumes". After this PR, _ensure_network_volume_deployed sets networkVolumeId from the first volume only. Adding or removing a second volume doesn't change networkVolumeId → drift detection misses it → the endpoint doesn't redeploy.
2. Hash break for all existing endpoints
_hashed_fields includes "datacenter". Before this PR, datacenter stored a DataCenter enum (defaulting to EU_RO_1). After, it stores Optional[List[DataCenter]]. Any existing state file has the old enum representation hashed in. On first re-run after upgrade, every endpoint without an explicit datacenter will see a hash mismatch → forced redeploy. Worth calling out in the changelog — users will see unexpected redeployments without understanding why.
3. Testing gap
The validation logic (single vs. list normalization, CPU datacenter constraint, duplicate DC check, volume-DC mismatch) is well tested. But the async deploy path — the core of the new feature — has no coverage:
| Path | Coverage |
|---|---|
_ensure_network_volume_deployed with multiple volumes |
None |
_deployed_volume_ids population across mixed is_created / not-created volumes |
None |
networkVolumeIds payload injected into _do_deploy |
None |
networkVolumeIds payload injected into update |
None |
Everything from "volumes are validated at creation" to "GraphQL receives networkVolumeIds" is untested. That's the entire runtime path for the feature.
Nits
_normalize_volumes([])inendpoint.pysilently returnsNoneinstead of raising — a user passing an empty list gets no error.- The
isinstance(self.datacenter, list) else [self.datacenter]branch insync_input_fieldsis unreachable —normalize_datacenteralways produces a list before any model validator runs.
Verdict
Two real issues worth fixing before merge: the drift detection gap (#1) and test coverage for the async deploy path (#3). The hash break (#2) is an unavoidable consequence of the design change but should be documented.
flash is hardcoded to EU-RO-1. The
DataCenterenum only has one value,NetworkVolume.dataCenterIdis frozen to it, andServerlessResource.datacenterdefaults to it with a prod-only guard that syncs it tolocations.DataCenternow has all 10 listed RunPod DCs. Thedatacenterparameter on bothEndpointandServerlessResourceaccepts a single DC, a list, a string, orNone(no location restriction, matching the console's default "all datacenters" behavior). The old default of EU-RO-1 is replaced withNone.Endpoint(volume=...)accepts a singleNetworkVolumeor a list. Network volumes are tied to individual DCs, so a one-volume-per-datacenter constraint is enforced inServerlessResource.normalize_network_volumes. When multiple volumes are attached, the deploy path sendsnetworkVolumeIdsto the GraphQL API instead of the singularnetworkVolumeId.CPU availability is limited to a subset of DCs.
CPU_DATACENTERS(EU-RO-1, US-TX-1, EU-SE-1) is exported and validated against invalidate_cpu_datacentersso you get a clear errorCloses AE-2422